In this 2-part bootcamps, we’re going to work through the steps of RNA-seq data analysis. We’ll start with visualizing the raw data and doing some explorations on it, then move on to some basic quality control steps to produce cleaned and normalized expression counts, finally ending with some downstream analyses.
The part 1 will cover the content related to exploration and quality control of raw data.
Start of Part 1
This is the base level package we will be needing to view and manipulate our data. We shall load the rest of the packages at the time of using the relevant functions required.
library(tidyverse)The raw sequence data that you get directly from the sequencing machine is in the format of a FASTQ file. These are unmapped sequencing reads containing the information for each read generated, and is present in the following formatting:
These sequence-read files now need to be mapped to the genome. The process of mapping or alignment results in mapped SAM or BAM files.
SAM: human-readable
BAM: binary format of a SAM file (computer-readable)
In order to generate your expression counts file, we need a BAM file. Usually, when you send samples for sequencing, the center will send you back the raw FASTQ files as well as aligned BAM files.
The steps of FASTQ alignment / BAM generation generally include:
There are several software available for performing the above steps in a Unix environment, each having it’s own features which might be better suited for different data types:
There are many tutorials available online that work through the steps of alignment: https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2019/Introduction/SS_DB/Materials/Practicals/Practical2_alignment_JK.html
Okay, once you have your BAM files, extracting the counts for your reads (i.e. how many times was a particular transcript sequenced) is pretty easy and straightforward, and you can do it in R itself.
library(Rsubread)
library(here)
##an exampled bam file from 1000 Genomes
aligned <- featureCounts("data/HG00097.mapped.ILLUMINA.bwa.GBR.exome.20130415.bam"), annot.inbuilt = "hg19", isPairedEnd = TRUE)
##expression counts
expression_counts <- aligned$counts
An example output from running the featureCounts
In this bootcamp, we can directly load the formatted data of
expression counts and phenotypes from the .txt files.
eDat <- read.delim(normalizePath("data/GSE157103_formatted_eDat.txt"), sep = "\t")
pDat <- read.delim(normalizePath("data/GSE157103_formatted_pDat.txt"), sep = "\t")##several methods to view dataframes
library(kableExtra, quietly = TRUE)##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
as_tibble(pDat)## # A tibble: 126 × 16
## ID Sample Age Sex COVID ICU APACH…¹ Charl…² Mecha…³ Venti…⁴ Hospi…⁵
## <chr> <chr> <int> <chr> <chr> <chr> <int> <int> <chr> <int> <int>
## 1 C1 COVID_… 39 male yes no 15 0 yes 0 0
## 2 C2 COVID_… 63 male yes no 0 2 no 28 39
## 3 C3 COVID_… 33 male yes no 0 2 no 28 18
## 4 C4 COVID_… 49 male yes no 0 1 no 28 39
## 5 C5 COVID_… 49 male yes no 19 1 yes 23 27
## 6 C6 COVID_… 0 male yes no 0 1 no 28 36
## 7 C7 COVID_… 38 fema… yes no 0 7 no 28 42
## 8 C8 COVID_… 78 male yes yes 43 7 yes 0 0
## 9 C9 COVID_… 64 fema… yes yes 31 2 yes 0 0
## 10 C10 COVID_… 62 male yes yes 34 1 yes 2 0
## # … with 116 more rows, 5 more variables: Ferritin_ng.ml <int>, CRP_mg.l <dbl>,
## # Procalcitonin_ng.ml <dbl>, Lactate_mmol.l <dbl>, Fibrinogen_mg.dL <int>,
## # and abbreviated variable names ¹APACHEII_Score, ²Charlson_Score,
## # ³Mechanical_Ventilation, ⁴Ventilator_free_days,
## # ⁵Hospital_free_days_post_45_days
##equivalent to using the pipeline command
# pDat %>%
# as_tibble()
# pDat %>%
# kable() %>%
# kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, fixed_thead = T)
# str(pDat)
# as_tibble((eDat[1:10, 1:10]))
##converting the column with all the gene names to assigned row names --
##we require certain identifiers (such as gene/sample names) as column
##and row names and not as a separate column or row for a few analyses downstream
eDat <- eDat %>%
column_to_rownames(var = "gene")
##equivalent to
# rownames(eDat) <- eDat$gene
# eDat <- eDat[ ,-c(1)]
pDat <- pDat %>%
column_to_rownames(var = "ID")
##Do the column names of our expression dataframe (our samples names)
##match the order of the rows in the ID column of the phenotype dataframe
all(colnames(eDat) == rownames(pDat))## [1] TRUE
##column names of pDat
colnames(pDat)## [1] "Sample" "Age"
## [3] "Sex" "COVID"
## [5] "ICU" "APACHEII_Score"
## [7] "Charlson_Score" "Mechanical_Ventilation"
## [9] "Ventilator_free_days" "Hospital_free_days_post_45_days"
## [11] "Ferritin_ng.ml" "CRP_mg.l"
## [13] "Procalcitonin_ng.ml" "Lactate_mmol.l"
## [15] "Fibrinogen_mg.dL"
##what's the separation of patients by sex?
pDat %>%
dplyr::count(COVID, Sex)## COVID Sex n
## 1 no female 13
## 2 no male 12
## 3 no unknown 1
## 4 yes female 38
## 5 yes male 62
##what's the ICU status by COVID status?
pDat %>%
dplyr::count(COVID, ICU)## COVID ICU n
## 1 no no 10
## 2 no yes 16
## 3 yes no 50
## 4 yes yes 50
##we can't do that for age, as it a continuous variable.
##making a new categorical variable from a continuous
pDat <- pDat %>%
mutate(Age_bracket = case_when(
Age < 20 ~ "below_20",
between(Age, 21, 40) ~ "21-40",
between(Age, 41, 60) ~ "41-60",
Age > 60 ~ "above_60"
))
##treat categorical variables as factors
pDat$Age_bracket <- as.factor(pDat$Age_bracket)
pDat %>%
count(COVID, Age_bracket)## COVID Age_bracket n
## 1 no 21-40 3
## 2 no 41-60 6
## 3 no above_60 17
## 4 yes 21-40 14
## 5 yes 41-60 29
## 6 yes above_60 56
## 7 yes below_20 1
pDat %>%
ggplot(aes(x = Age_bracket, fill = Age_bracket)) +
geom_bar(stat = "count") +
facet_grid(~COVID)##reorder levels
pDat$Age_bracket <- fct_relevel(pDat$Age_bracket, c("below_20", "21-40", "41-60", "above_60"))
##Let's now plot this distribution of our COVID patients by their age bracket
pDat %>%
ggplot(aes(x = Age_bracket, fill = Age_bracket)) +
geom_bar(stat = "count") +
theme_minimal() +
facet_grid(~COVID) +
labs(title = "Distribution of COVID Patients by Age Bracket")##Now, let's check how protein levels vary by COVID status
##Making a new variable that only includes protein measurements
names(pDat)## [1] "Sample" "Age"
## [3] "Sex" "COVID"
## [5] "ICU" "APACHEII_Score"
## [7] "Charlson_Score" "Mechanical_Ventilation"
## [9] "Ventilator_free_days" "Hospital_free_days_post_45_days"
## [11] "Ferritin_ng.ml" "CRP_mg.l"
## [13] "Procalcitonin_ng.ml" "Lactate_mmol.l"
## [15] "Fibrinogen_mg.dL" "Age_bracket"
proteins <- pDat %>%
dplyr::select(COVID, Ferritin_ng.ml, CRP_mg.l, Procalcitonin_ng.ml, Lactate_mmol.l, Fibrinogen_mg.dL)
##We're now going to collapse this `wide` dataframe into a `long` one
proteins <- proteins %>%
pivot_longer(cols = 2:6, names_to = "protein", values_to = "measurement")
##plotting the spread of each of the proteins by COVID status:
proteins %>%
ggplot(aes(x = COVID, y = measurement, fill = COVID)) +
geom_boxplot() +
# geom_jitter(shape = 16, colour = "grey", alpha = 0.5, width = 0.2) +
# scale_fill_manual(values = c("orange", "grey")) +
facet_wrap(~protein, scales = "free") + #scales = free allows the y-axis of each plot to have variable limits
theme_minimal()##how about mechanical ventilation? what percentage of COVID patients needed mechanical ventilation?
pDat %>%
# filter(COVID == "yes") %>% #1
ggplot(aes(x = Mechanical_Ventilation, fill = Mechanical_Ventilation)) +
geom_bar(stat = "count", width = 0.6) +
scale_fill_manual(values = c("grey", "orange")) +
theme_minimal() +
facet_grid(~COVID) +
labs(y = "Counts of patients", title = "COVID Patients")Let’s now look at the distribution of our expression data
dim(eDat)## [1] 19472 126
#19372 126
library(reshape2, quietly = TRUE) #similar functioning to pivot_longer##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
##melt is a function that condenses all of your data into only two columns -
##one with the character value and one with it's corresponding numerical value
e_melt <- melt(eDat)## No id variables; using all as measure variables
head(e_melt)## variable value
## 1 C1 29.01
## 2 C1 0.00
## 3 C1 17.00
## 4 C1 3.00
## 5 C1 1.00
## 6 C1 0.00
colnames(e_melt)[1:2] <- c("sample", "expression")
e_melt %>%
ggplot(aes(x = log2(expression), color = sample, fill = sample)) +
geom_density(alpha = 0.1) +
theme_minimal() +
theme(legend.position = "none") + #has to come after specifying theme
labs(x = "log2RPM", y = "Density", title = "Sample Distribution - Density Plot", subtitle = "Raw Counts\n")## Warning: Removed 544485 rows containing non-finite values (stat_density).
##We get this error - Removed 544485 rows containing non-finite values (stat_density).
##That's because we have some genes which have an expression value of 0, which when transformed
##to log2 give infinity as the output as log2(0) does not exist. Hence, we will apply a log2+1
##transformation which adds a unit of 1 to all log2 counts, hence converting our log2(0) expression
##values to 0.
e_melt <- e_melt %>%
mutate(log_x_1 = log2(expression + 1))
##We'll store this plot in a variable
g1 <- e_melt %>%
ggplot(aes(log_x_1, color = sample, fill = sample)) +
geom_density(alpha = 0.1) +
theme_minimal() +
theme(legend.position = "none") + #has to come after specifying theme
labs(x = "log2(x+1) RPM", y = "Density", title = "Sample Distribution - Density Plot", subtitle = "Raw Counts\n")
g1##Okay, now we have an idea about how our expression data looks. Let's now
##take a look at our samples. How do they correlate with each other?
samp_cor <- cor(eDat)
dim(samp_cor)## [1] 126 126
##to visualize our correlations, we'll make a heatmap of the sample correlation values
library(pheatmap, quietly = TRUE)
h1 <- samp_cor %>%
pheatmap(clustering_distance_rows = "euclidean", clustering_distance_cols = "euclidean",
clustering_method = "complete", cluster_rows = TRUE, cluster_cols = TRUE,
show_colnames = FALSE, show_rownames = FALSE,
annotation_row = pDat[c("COVID", "Sex")], annotation_col = pDat[c("COVID", "Sex")],
main = "Sample Correlations"
)##let's add some custom colours
annot_cols <- list(COVID = c(`yes` = "grey", `no` = "orange"),
Sex = c(`male` = "sea green", `female` = "purple", `unknown` = "yellow"))
h1 <- samp_cor %>%
pheatmap(clustering_distance_rows = "euclidean", clustering_distance_cols = "euclidean",
clustering_method = "complete", cluster_rows = TRUE, cluster_cols = TRUE,
show_colnames = FALSE, show_rownames = FALSE,
annotation_row = pDat[c("COVID", "Sex")], annotation_col = pDat[c("COVID", "Sex")],
annotation_colors = annot_cols,
main = "Sample Correlations"
)We’ll explore a few different filtering criteria and decide on one, and then normalize our data.
#dim(eDat)
##removing sequences with RPM of 0 in all samples /
##keeping only sequences with RPM > 0 in at least 1 sample
e_fil1 <- eDat %>%
rownames_to_column(var = "gene") %>%
filter_at(vars(-gene), any_vars(. != 0)) %>%
column_to_rownames(var = "gene")
#dim(e_fil1)
#19472 - 18340
# 1132 sequences/genes removed
melt_fil1 <- melt(e_fil1) ## No id variables; using all as measure variables
melt_fil1 <- melt_fil1 %>%
mutate(log_x_1 = log2(value + 1))
g_fil1 <- melt_fil1 %>%
ggplot(aes(log_x_1, color = variable, fill = variable)) +
geom_density(alpha = 0.1) +
theme_minimal() +
theme(legend.position = "none") + #has to come after specifying theme
labs(x = "log2(x+1) RPM", y = "Density", title = "Sample Distribution - Density Plot", subtitle = "Removing Sequences with RPM of 0 in all samples", caption = "Raw counts")
g_fil1##keeping only sequences with RPM >= 1 in all samples
e_fil2 <- eDat %>%
rownames_to_column(var = "gene") %>%
filter_at(vars(-gene), all_vars(. >= 1)) %>%
column_to_rownames(var = "gene")
#dim(e_fil2)
#19472 - 11860
#7612 sequences removed
melt_fil2 <- melt(e_fil2)## No id variables; using all as measure variables
melt_fil2 <- melt_fil2 %>%
mutate(log_x_1 = log2(value + 1))
g_fil2 <- melt_fil2 %>%
ggplot(aes(log_x_1, color = variable, fill = variable)) +
geom_density(alpha = 0.1) +
theme_minimal() +
theme(legend.position = "none") + #has to come after specifying theme
labs(x = "log2(x+1) RPM", y = "Density", title = "Sample Distribution - Density Plot", subtitle = "Sequences with RPM >= 1 in in all samples", caption = "Raw counts")
g_fil2##keeping only sequences with RPM >= 1 in 30% of samples --> depends on groups in data
# e_fil4 <- eDat %>%
# filter(rowSums(across(where(is.numeric)) >= 1) > 38)
RPM_morethan1 <- eDat >= 1
##This gives you a matrix with boolean values. Now to get the sum of all the TRUEs in each sample
table(rowSums(RPM_morethan1))##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 1172 369 223 150 122 123 101 85 90 82 71 68 55
## 13 14 15 16 17 18 19 20 21 22 23 24 25
## 47 48 46 45 49 58 30 49 45 49 47 32 44
## 26 27 28 29 30 31 32 33 34 35 36 37 38
## 32 39 35 35 38 30 41 34 36 40 25 33 24
## 39 40 41 42 43 44 45 46 47 48 49 50 51
## 27 23 27 20 25 37 30 36 19 34 30 32 16
## 52 53 54 55 56 57 58 59 60 61 62 63 64
## 22 22 30 33 26 25 28 26 38 21 21 18 19
## 65 66 67 68 69 70 71 72 73 74 75 76 77
## 23 17 34 27 30 27 31 29 25 23 22 20 22
## 78 79 80 81 82 83 84 85 86 87 88 89 90
## 15 30 27 31 29 24 26 30 22 37 25 26 31
## 91 92 93 94 95 96 97 98 99 100 101 102 103
## 27 32 32 25 31 35 25 35 35 35 22 43 41
## 104 105 106 107 108 109 110 111 112 113 114 115 116
## 31 34 37 42 43 47 52 50 42 58 55 62 72
## 117 118 119 120 121 122 123 124 125 126
## 82 75 71 95 97 134 175 263 467 11860
##This shows that there are 1172 sequences where all samples have a RPM of less than 1.
##And there are 11860 sequences which have a RPM pf >= 1 in all 126 samples --> the same
##number we got for e_fil2. This matching of the results/numbers is called a sanity-check,
##which you should be doing often i.e., making sure that the results that you are getting
##are indeed correct and are cross-checked by some other method.
e_fil3 <- as.data.frame(rowSums(RPM_morethan1) >= 38)
##keeping only sequences which have a total of 38 TRUEs (i.e. an RPM of >= 1) values or more
e_fil3 <- e_fil3 %>%
filter(.[1] == "TRUE")
e_fil3 <- eDat %>%
filter(rownames(eDat) %in% rownames(e_fil3))
#dim(e_fil3)
#19472 - 15754
#3718 sequences removed
melt_fil3 <- melt(e_fil3) ## No id variables; using all as measure variables
melt_fil3 <- melt_fil3 %>%
mutate(log_x_1 = log2(value + 1))
g_fil3 <- melt_fil3 %>%
ggplot(aes(log_x_1, color = variable, fill = variable)) +
geom_density(alpha = 0.1) +
theme_minimal() +
theme(legend.position = "none") + #has to come after specifying theme
labs(x = "log2RPM", y = "Density", title = "Sample Distribution - Density Plot", subtitle = "Sequences with RPM >= 1 in 30% of samples", caption = "Raw counts")
g_fil3##plot all density plots together
library(ggpubr)
ggarrange(g_fil1, g_fil2, g_fil3)##let's now put the total number of sequences retained after
##each of the filtering steps into a separate dataframe
seq_counts <- data.frame(nrow(eDat))
seq_counts$fil_1 <- nrow(e_fil1)
seq_counts$fil_2 <- nrow(e_fil2)
seq_counts$fil_3 <- nrow(e_fil3)
colnames(seq_counts) <- c("All sequences", "Sequences with RPM > 0 in at least 1 sample", "Sequences with RPM >= 1 in all samples", "Sequences with RPM >= 1 in 30% of samples")
seq_counts <- as.data.frame(t(seq_counts))
seq_counts <- seq_counts %>%
rownames_to_column()
colnames(seq_counts)[1:2] <- c("filtering_criteria", "sequences")
seq_counts %>%
ggplot(aes(x = filtering_criteria, y = sequences, fill = filtering_criteria)) +
geom_bar(stat = "identity") +
theme_minimal()seq_counts$filtering_criteria <- c("All sequences", "Sequences with\nRPM > 0 in atleast\n1 sample", "Sequences with\nRPM >= 1 in\nall samples", "Sequences with\nRPM >= 1 in\n30% of samples")
seq_counts %>%
ggplot(aes(x = filtering_criteria, y = sequences, fill = filtering_criteria)) +
geom_bar(stat = "identity", width = 0.6) +
# scale_fill_viridis_d() +
# geom_text(aes(label = sequences), vjust = -0.8, colour = "#333333", position = position_dodge(0.65)) +
# scale_y_continuous(expand = expansion(mult = c(0, .09))) +
theme_minimal() +
# theme(legend.position = "none") +
labs(x = "Filtering Criteria", y = "No. of Sequences", title = "No. of sequences by filtering criteria")The filtering step you choose depends upon the question you’re asking of your data:
We’re now going to apply the Relative-Log Expression method to
normalize our data - an essential step to make sure that we curb any
outliers in our data as to not overestimate highly-expressed genes and
have the expression distributed normally.
RLE accounts for between-sample variation, after which we’ll scale by
RPM to account for within-sample variation
library(edgeR, quietly = TRUE)
genes <- as.data.frame(row.names(e_fil2))
norm <- DGEList(counts = e_fil2, samples = pDat, genes = genes, group = pDat$COVID)
eNorm <- calcNormFactors(norm, method = "RLE")
eNorm <- cpm(eNorm)
eNorm <- as.data.frame(eNorm)
melt_norm <- melt(eNorm) ## No id variables; using all as measure variables
melt_norm %>%
ggplot(aes(log2(value), color = variable, fill = variable)) +
geom_density(alpha = 0.1) +
theme_minimal() +
theme(legend.position = "none") + #has to come after specifying theme
labs(x = "log2 RPM", y = "Density", title = "Sample Distribution - Density Plot: RLE Normalized Counts", subtitle = "Sequences with RPM >= 1 in all samples")##however, now that we normalized our data, that means that some of the
##expression counts might have been changed to 0, as we can very well see.
##We'll perform the same x+1 transformation again
melt_norm <- melt_norm %>%
mutate(log_x_1 = log2(value + 1))
g2 <- melt_norm %>%
ggplot(aes(log_x_1, color = variable, fill = variable)) +
geom_density(alpha = 0.1) +
theme_minimal() +
theme(legend.position = "none") + #has to come after specifying theme
labs(x = "log2 (x+1) RPM", y = "Density", title = "Sample Distribution - Density Plot: RLE Normalized Counts", subtitle = "Sequences with RPM >= 1 in all samples")
g2##Raw versus normalized counts
ggarrange(g1, g2)Let’s now plot the raw and normalized expression counts of a random gene
eDat <- eDat %>%
rownames_to_column(var = "gene")
random_sample <- eDat %>%
# rownames_to_column(var = "gene")
filter(gene == "TRIM62") %>%
column_to_rownames(var = "gene")
sample_from_eNorm <- eNorm %>%
rownames_to_column(var = "gene") %>%
filter(gene == "TRIM62") %>%
column_to_rownames(var = "gene")
row.names(random_sample)[1] <- "TRIM62_raw"
row.names(sample_from_eNorm)[1] <- "TRIM62_norm"
TRIM62 <- rbind(random_sample, sample_from_eNorm)
TRIM62 <- TRIM62 %>%
rownames_to_column(var = "TRIM62_value")
TRIM62 <- TRIM62 %>%
pivot_longer(cols = -c(TRIM62_value), names_to = "sample", values_to = "RPM")
TRIM62 %>%
ggplot(aes(x = sample, y = RPM, colour = TRIM62_value)) +
geom_point(size = 2) +
scale_colour_manual(values = c("forest green", "orange")) +
theme_classic() +
theme(legend.position = "bottom") +
labs(x = "Sample", y = "RPM", title = "Change in TRIM62 expression value before and after normalization", subtitle = "raw vs. RLE Normalized Counts")We’ll now save our processed expression dataframe as a tab-delimited text file which we can load in for Part 2.
eNorm <- eNorm %>%
rownames_to_column(var = "gene")
write_delim(eNorm, file = normalizePath("data/eNorm.txt"), delim = "\t")End of Part 1